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Abstract 

With the arrival of the R packages nlme and lme4, linear mixed mod¬ 
els (LMMs) have come to be widely used in experimentally-driven areas 
like psychology, linguistics, and cognitive science. This tutorial provides a 
practical introduction to fitting LMMs in a Bayesian framework using the 
probabilistic programming language Stan. We choose Stan (rather than 
WinBUGS or JAGS) because it provides an elegant and scalable frame¬ 
work for fitting models in most of the standard applications of LMMs. We 
ease the reader into fitting increasingly complex LMMs, first using a two- 
condition repeated measures self-paced reading study, followed by a more 
complex 2x2 repeated measures factorial design that can be generalized 
to much more complex designs. 


1 Introduction 

Linear mixed models, or hierarchical/multilevel linear models, have become the 
main workhorse of experimental research in psychology, linguistics, and cogni¬ 
tive science, where repeated measures designs are the norm. Within the pro¬ 
gramming environment R [53], the nlme package [55] and its successor, lme4 
[3] have revolutionized the use of linear mixed models (LMMs) due to their 
simplicity and speed: one can fit fairly complicated models relatively quickly, 
often with a single line of code. A great advantage of LMMs over traditional 
approaches such as repeated measures ANOVA and paired t-tests is that there 
is no need to aggregate over subjects and items to compute two sets of F-scores 
(or several t-scores) separately; a single model can take all sources of variance 
into account simultaneously. Furthermore, comparisons between conditions can 
easily be implemented in a single model through appropriate contrast coding. 
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Other important developments related to LMMs have been unfolding in 
computational statistics. Specifically, probabilistic programming languages like 
WinBUGS [IH], JAGS [53] and Stan [53, among others, have made it possible to 
fit Bayesian LMMs quite easily. However, one prerequisite for using these pro¬ 
gramming languages is that some background statistical knowledge is needed 
before one can define the model. This difficulty is well-known; for example, 
Spiegelhalter and colleagues (26] 4] write: “Bayesian statistics has a (largely 
deserved) reputation for being mathematically challenging and difficult to put 
into practice... ”. 

The purpose of this paper is to facilitate a first encounter with model spec¬ 
ification in one of these programming languages, Stan. The tutorial is aimed 
primarily at psychologists, linguists, and cognitive scientists who have used 
line4 to fit models to their data, but may have only a basic knowledge of the 
underlying LMM machinery. A diagnostic test is that they may not be able to 
answer some or all of these questions: what is a design matrix; what is contrast 
coding; what is a random effects variance-covariance matrix in a linear mixed 
model? Our tutorial is not intended for statisticians or psychology researchers 
who could, for example, write their own Markov Ghain Monte Carlo samplers in 
R or CH—h or the like; for them, the Stan manual is the optimal starting point. 
The present tutorial attempts to ease the beginner into their first steps towards 
fitting Bayesian linear mixed models. More detailed presentations about linear 
mixed models are available in several textbooks; references are provided at the 
end of this tutorial. 

We have chosen Stan as the programming language of choice (over JAGS 
and WinBUGS) because it is possible to fit arbitrarily complex models with 
Stan. For example, it is possible (if time consuming) to fit a model with 14 
fixed effects predictors and two crossed random effects by subject and item, 
each involving a 14 x 14 variance-covariance matrix [2]; as far as we are aware, 
such models cannot, as far as we know, be fit in JAGS or WinBUGSQ 

In this tutorial, we take it as a given that the reader is interested in learning 
how to fit Bayesian linear mixed models. We do not try to explain the advan¬ 
tages this approach affords beyond the classical frequentist approach; for such 
justification, the reader is directed to the rich literature relating to a comparison 
between Bayesian versus frequentist statistics (such as the provocatively titled 
paper by Lavine m-, and the highly accessible textbook by Kruschke [E]). We 
also assume that the reader is aware of Bayes’ Theorem, which for our pur¬ 
poses amounts to the statement that the posterior distribution is proportional 
to the prior times the likelihood. For the purposes of this paper, the goal of a 
Bayesian analysis is simply to derive the posterior distribution of each parame¬ 
ter of interest, given some data and prior beliefs about the distributions of the 
parameters. 

The tutorial is structured as follows. We begin by successively building up 
increasingly complex LMMs using the data-set reported by Gibson and Wu[5], 

^Whether it makes sense in general to fit such a complex model is a different issue; see [2, 
and [2] for recent discussion. 
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which has a simple two-condition design. At each step, we explain the structure 
of the model. The next section takes up inference for this two-condition design. 
Then we demonstrate how one can fit a somewhat more complex 2x2 factorial 
design. 

This paper was written using a literate programming tool, knitr |29] : this 
integrates documentation for the accompanying code with the paper. The knitr 
file that generated this paper, as well as all the code and data used in this 
tutorial, can be downloaded from our website: 

http://www.ling.uni-potsdam.de/'^vasishth/statistics/BayesLMMs.html 
We start with the two-condition repeated measures data-set [5] as a concrete 
running example. This simple example serves as a starter kit for fitting com¬ 
monly used LMMs in the Bayesian setting. We assume that the reader has the 
relevant software installed; specifically, rstan in R. For detailed instructions, 
see 

https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started 


2 Example 1: A two-condition repeated mea¬ 
sures design 

This section motivates the LMM with the self-paced reading data-set of Gibson 
and Wu [S]. We introduce the data-set, state our modeling goals here, and 
proceed to build up increasingly complex LMMs. 

The scientific question Subject and object relative clauses have been widely 
used in reading studies to investigate sentence comprehension processes. A sub¬ 
ject relative is a sentence like The senator who interrogated the journalist re¬ 
signed where a noun {senator) is modified by a relative clause {who interrogated 
the journalist) , and the modified noun is the grammatical subject of the rela¬ 
tive clause. In an object relative, the noun modified by the relative clause is the 
grammatical object of the relative clause (e.g.. The senator who the journalist 
interrogated resigned). In both cases, the noun that is modified {senator) is 
called the head noun. 

A typical finding for English is that subject relatives are easier to process 
than object relatives [12]. Natural languages generally have relative clauses, 
and the subject relative advantage has until recently been considered to be true 
cross-linguistically. However, Ghinese relative clauses apparently represent an 
interesting counter-example to this generalization; recent work by Hsiao and 
Gibson m has suggested that in Ghinese, object relatives are easier to process 
than subject relatives at a particular point in the sentence (the head noun of 
the relative clause). We now present an analysis of a subsequently published 
data-set jS] that evaluates this claim. 

The data The dependent variable of the experiment of Gibson and Wu jS] was 
the reading time rt of the head noun of the relative clause. This was recorded 
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in two conditions (subject relative and object relative), with 37 subjects and 

15 items, presented in a standard Latin square design. There were originally 

16 items, but one item was removed, resulting in 37 x 15 = 555 data points. 
However, eight data points from one subject (id 27) were missing. As a conse¬ 
quence, we have a total of 555 — 8 = 547 data points. The first few lines from 
the data frame are shown in Table [T] “o” refers to object relative and “s” to 
subject relative. 


row 

subj 

item 

SO 

rt 

1 

1 

13 

0 

1561 

2 

1 

6 

s 

959 

3 

1 

5 

0 

582 

4 

1 

9 

0 

294 

5 

1 

14 

s 

438 

6 

1 

4 

s 

286 

547 

9 

11 

0 

350 


Table 1: First six rows, and the last row, of the data-set of Gibson and Wu 
(2013), as they appear in the data frame. 

We build up the Bayesian LMM from a hxed effects model to a varying 
intercepts model and finally to a varying intercepts, varying slopes model (the 
“maximal model” of Barr and colleagues my The result is a probability model 
that expresses how the dependent variable, the reading time labeled rt, was 
generated in the experiment of Gibson and Wu [8]. 

As mentioned above, the goal of Bayesian modeling is to derive the posterior 
probability distribution of the model parameters from a prior probability distri¬ 
bution and a likelihood function. Stan makes it easy to compute this posterior 
distribution of each parameter of interest. The posterior distribution reflects 
what we should believe, given the data, regarding the value of that parameter. 

2.1 Fixed Effects Model (Simple Linear Model) 

We begin by making the working assumption that the dependent variable of 
reading time rt on the head noun is approximately log-normally distributed [25] . 
This assumes that the logarithm of rt is approximately normally distributed. 
The logarithm of the reading times, logrt, has some unknown grand mean 
/3o- The mean of the log-normal distribution of rt is the sum of /3o and an 
adjustment /3i x so whose magnitude depends on the categorical predictor so, 
which has the value —1 when rt is from the subject relative condition, and -|-1 
when rt is from the object relative condition. One way to write the model in 
terms of the logarithm of the reading times is as follows: 

logrti =/3o-I-Asoj-|-£i (1) 
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The index i represents the i-th row in the data-frame (in this case, i = 
1,..., 547); the term £i represents the error in the i-th row. With the above ±1 
contrast coding, /3o represents the grand mean of logrt, regardless of relative 
clause type. It can be estimated by simply taking the grand mean of logrt. 
The parameter j3i is an adjustment to /?o so that the mean of logrt is /3o + l/3i 
when logrt is from the object relative condition, and /3o — l/3i when logrt is 
from the subject relative condition. Notice that 2 x /3i will be the difference in 
the means between the object and subject relative clause conditions. Together, 
/3o and /3i make up the part of the model which characterizes the effect of the 
experimental manipulation, relative clause type (so), on the dependent variable 
rt. We call this a fixed effects model because we estimate the (3 parameters, 
which are unvarying from subject to subject and from item to item. In R, this 
would correspond to fitting a simple linear model using the Im function, with 
so as predictor and logrt as dependent variable. 

The error e is positive when log rt^ is greater than the expected value 
Mi = /3o + Pisoi and negative when logrt^ is less than the expected value 
Thus, the error is the amount by which the expected value differs from actually 
observed value. It is standardly assumed that the £i are independently and 
identically distributed as a normal distribution with mean zero and unknown 
standard deviation Ce ■ Stan parameterizes the normal distribution by the mean 
and standard deviation, and we follow that convention here, by writing the dis¬ 
tribution of £ as N(0, tTe) (the standard notation in statistics is in terms of mean 
and variance). A consequence of the assumption that the errors are identically 
distributed is that the distribution of e should, at least approximately, have the 
same shape as the normal distribution. Independence implies that there should 
be no correlation between the errors—this is not the case in the data, since we 
have multiple measurements from each subject, and from each item. 

Setting up the data We now fit the fixed effects Model [TJ For the following 
discussion, please refer to the code in Listings [T] and [2] in the appendix. First, 
we read the Gibson and Wu [S] data into a data frame rDat in R, and then 
subset the critical region (Listing [H lines 2 and 4). Next, we create a data list 
stanDat for Stan, which contains the data (Listing I, line 7). Stan requires the 
data to be of type list; this is different from the Im and Imer functions, which 
assume that the data are in of type data-frame. 

Defining the model The next step is to write the Stan model in a text file 
with extension . stan. A Stan model consists of several blocks. A block is a 
set of statements surrounded by brackets and preceded by the block name. We 
open up a file fixEf .stan in a text editor and write down the first block, the 
data block, which contains the declaration of the variables in the data object 
StanDat (Listing [2l lines 1-5). The strings real and int specify the data type 
for each variable. A real variable is a real number, and an int variable is an 
integer. For instance, N is the integer number of data points. The variables so 
and rt are arrays of length N whose entries are real. We constrain a variable 
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to take only a subset of the values allowed by its type (e.g. int or real) by 
specifying in brackets lower and upper bounds (e.g. <lower=-l ,upper=l>). 

Next, we turn to the parameters block, where the parameters are defined 
(Listing [21 lines 6-9). The fixed effects Model [T] has three parameters: the fixed 
intercept /3o, the fixed slope /3i, and the standard deviation ae of the error. The 
fixed effects and fii are in the vector beta of length two; note that although 
we called our parameters and /3i in Model [T1 in Stan, these are contained in 
a vector with indices 1 and 2, so /3o is in betafl] and fii in beta [2]. The third 
parameter, the standard deviation Ue of the error (sigma_e), is also defined here, 
and is constrained to have lower bound 0 (Listing |21 line 8). Finally, the model 
block specifies the prior distribution and the likelihood (Listing |21 lines 10-15). 

To understand the Stan syntax, compare the Stan code above to the spec¬ 
ification of Model |T] The Stan code literally writes out this model. The block 
begins with a local variable declaration for mu, which is the mean of rt con¬ 
ditional on whether so is —1 for the subject relative condition or -|-1 for the 
object relative condition. 

The prior distributions on the parameters beta and sigma_e would ordinarily 
be declared in the model block. If we don’t declare any prior, it is assumed that 
they have a uniform prior distribution. Note that the distribution of sigma_e is 
truncated at zero because sigma_e is constrained to be positive (see the decla¬ 
ration real<lower=0> sigma_e; in the parameters block). So, this means that 
the error has a uniform prior with lower bound 0. 

In the model block, the for-loop assigns to mu the mean for the log-normal 
distribution of rt [i], conditional on the value of the predictor so [i] for relative 
clause type. The statement rt [i] ~ lognormal (mu, sigma_e) means that the 
logarithm of rt is normally distributed with mean mu and standard deviation 
sigma_e. One could have equally well log-transformed the reading time and 
assumed a normal distribution instead of the lognormal. 

Running the model We save the file f ixEf . stan which we just wrote and fit 
the model in R with the function stan from the package rstan (Listing [U lines 
9 and 10). This call to the function stan will compile a C-I--I- program which 
produces samples from the joint posterior distribution of the fixed intercept /?o, 
the fixed slope /3i, and the standard deviation Ue of the error. Here, the function 
generates four chains of samples, each of which contains 2000 samples of each 
parameter. Samples 1 to 1000 are part of the warmup, where the chains settle 
into the posterior distribution. We analyze samples 1001 to 2000. The result is 
saved to an object fixEfFit of class StanFit. 

Evaluating model convergence and summarizing resnlts The first step 
after running the above function should be to look at the trace plot of each 
chain after warmup, using the command shown in Listing [T] lines 13 and 14. 
A trace plot has the chains plotted against the sample ID. In Figure [U we see 
four different chains plotted against sample number going from 1001 to 2000. If 
the trace plot looks like a “fat, hairy caterpillar” m which does not bend, this 
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Trace of beta[1] 



1000 1200 1400 1600 1800 2000 

Iterations {without warmup) 

Trace of beta[2] 



1000 1200 1400 1600 1800 2000 

Iterations {without warmup) 

Trace of sigma__e 



1000 1200 1400 1600 1800 2000 

Iterations {without warmup) 


Figure 1: Trace plots of the fixed intercept /3o, the fixed slope /3i, and the 
standard deviation Ue of the error for the hxed effects Model [T] 


suggests that the chains have converged to the posterior distribution. 

The second diagnostic which we use to assess whether the chains have con¬ 
verged to the posterior distribution is the statistic Rhat. Each parameter has the 
Rhat statistic associated with it [7] ; this is essentially the ratio of between-chain 
variance to within-chain variance (analogous to ANOVA). The Rhat statistic 
should be approximately 1 ± 0.1 if the chain has converged. This is shown in 
the rightmost column of the model summary, see Table [2l 

Having satisfied ourselves that the chains have converged, we turn to examine 
this posterior distribution. (If there is an indication that convergence has not 
happened, then, assuming that the model has no errors in it, increasing the 
number of samples usually resolves the issue.) 


parameter 

mean 

2.5% 

97.5% 

R 

/3o 

6.06 

0.03 

6.11 

1 

/3i 

-0.04 

-0.09 

0.01 

1 

de 

0.60 

0.56 

0.64 

1 


Table 2: Examining the credible intervals and the R-hat statistic in the Gibson 
and Wu data. 

The result of fitting the fixed effects Model[T]is the joint posterior probability 
distribution of the parameters /Iq, /3i, and Ug. The distribution is joint because 
each of the (4 chains x 1000 post-warmup iterations =)4000 posterior samples 
which the call to stan generates is a vector 9 = (/Ig,/3i,Ue)^ of three model 
parameters. Thus, the object f ixEf Fit contains 4000 parameter vectors 9 which 
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Figure 2: Bivariate joint posterior probability distribution of each element of 
6 with each other element (lower diagonal) and marginal posterior probability 
distribution of each element of 9 separately (diagonal). All parameters are on 
the log scale, but note the difference in length scale between /3i on the one hand 
and Po and erg on the other. 


occupy a three dimensional space. Already in three dimensions, the posterior 
distribution becomes difficult to view in one graph. Figure [5] displays the joint 
posterior probability distribution of the elements of 9 by projecting it down onto 
planes. In each of the three planes (lower triangular scattergrams) we see how 
one parameter varies with respect to the other. In the diagonal histograms, 
we visualize the marginal probability distribution of each parameter separately 
from the other parameters. 

Of immediate interest is the marginal distribution of the slope /3i. Figure [2] 
suggests that the posterior probability density of /3i is mainly spread over the 
interval (—oo,0). One quantitative way to assess the posterior probability dis¬ 
tribution is to examine its quantiles; see Table [H Here, it is useful to define 
the concept of the credible interval. The (1 — a)% credible interval contains 
(1 — a)% of the posterior probability density. Unlike the (1 — a)% confidence 
interval from the frequentist setting, the (1 — a)% credible interval represents 
the range within which we are (1 — a)% certain that the true value of the pa¬ 
rameter lies, given the prior and the data (see [21] for further discussion on CIs 
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vs credible intervals). A common convention is to use the interval ranging from 
the 2.5 to 97.5 percentiles. We follow this convention and 95% credible intervals 
in Tabled 

The samples of /3i suggests that approximately 94% of the posterior proba¬ 
bility density is below zero, suggesting that there is some evidence that object 
relatives are easier to process than subject relatives in Chinese, given the Gib¬ 
son and Wu data. However, since the 95% credible interval includes 0, we may 
be reluctant to conclude that object relatives are easier to process. We will say 
more about the evaluation of research hypotheses further on. 

2.2 Varying Intercepts Mixed Effects Model 

The fixed effects Model[T]is inappropriate for the Gibson and Wu data because it 
does not take into account the fact that we have multiple measurements for each 
subject and item. As mentioned above, these multiple measurements lead to a 
violation of the independence of errors assumption. Moreover, the fixed effects 
coefficients /3o and Pi represent means over all subjects and items, ignoring the 
fact that some subjects will be faster and some slower than average; similarly, 
some items will be read faster than average, and some slower. 

In linear mixed models, we take this by-subject and by-item variability into 
account by adding adjustment terms u^j and WQk, which adjust /3o for subject j 
and item k. This partially decomposes £i into a sum of the terms u^j and icofci 
which are adjustments to the intercept /3o for the subject j and item k associated 
with rt j. If subject j is slower than the average of all the subjects, Uj would be 
some positive number, and if item k is read faster than the average reading time 
of all the items, then Wk would be some negative number. Each subject j has 
their own adjustment uoj, and each item its own wok- These adjustments uoj 
and wok are called random intercepts by Pinheiro and Bates [22] and varying 
intercepts by Gelman and Hill [6], and by adjusting /3o by these we account for 
the variability between speakers, and between items. 

It is standardly assumed that these adjustments are normally distributed 
around zero with unknown standard deviation; mq ~ N(0, tT„) and wq ^ N(0, CTu,); 
the subject and item adjustments are also assumed to be mutually independent. 
We now have three sources of variance in this model: the standard deviation of 
the errors Ue, the standard deviation of the by-subject random intercepts ( 7 „, 
and the standard deviation of the by-item varying intercepts ctu,. We will refer 
to these as variance components. 

We now express the logarithm of reading time, which was produced by sub¬ 
jects j = I,..., 37 reading items fc = I,..., 15, in conditions i = 1,2 {1 refers to 
subject relatives, 2 to object relatives), as the following sum. Notice that we are 
now using a slightly different way to describe the model, compared to the fixed 
effects model. We are using indices for subject, item, and condition to identify 
unique rows. Also, instead of writing /3iso, we index /3i by the condition i. This 
follows the notation used in the textbook on linear mixed models, written by 
the authors of nlme [22], the precursor to lme4. 
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log rtyfe — /3o + Pli + Uoj + Wok + £ijk (2) 

Model [2] is an LMM, and more specifically a varying intercepts model. The 
coefficient jdu is the one of primary interest; it will have some mean value —/3i 
for subject relatives and +/3i for object relatives due to the contrast coding. So, 
if our posterior mean for f3i is negative, this would suggest that object relatives 
are read faster than subject relatives. 

We fit the varying intercepts Model [2] in Stan in much the same way as the 
fixed effects Model [T] For the following discussion, please consult Listing [3] for 
the R code used to run the model, and Listing U for the Stan code. 

Setting up the data The data which we prepare for passing on to the func¬ 
tion Stan now includes subject and item information (Listing[3l lines 2-8). The 
data block in the Stan code accordingly includes the number J, K of subjects 
and items, respectively; and the variable N records the number of rows in the 
data frame. 

Defining the model ModeljH shown in Listing^ still has the fixed intercept 
/3o, the fixed slope /3i, and the standard deviation Ce of the error, and we specify 
these in the same way as we did for the fixed effects Model [T] In addition, the 
varying intercepts Model[2]has by-subject varying intercepts u^j for j = 1,..., J 
and by-item varying intercepts wofc for k = 1,... ,K. The standard deviation 
of uo is au and the standard deviation of wo is aw We again constrain the 
standard deviations to be positive. 

The model block places normal distribution priors on the varying inter¬ 
cepts uq and wq. We implicitly place uniform priors on sigma_u, sigma_w, and 
sigma_e by omitting them from the model block. As pointed out earlier for 
sigma_e, these prior distributions have lower bound zero because of the con¬ 
straint <lower=0> in the variable declarations. 

The statement about how each row in the data is generated is shown in 
Listing m lines 26-29; here, both the fixed effects and the varying intercepts for 
subjects and items determine the expected value mu. The vector u has vary¬ 
ing intercepts for subjects. Likewise, the vector w has varying intercepts for 
items. The for-loop in lines 26-29 now adds u[subj [i]] + w[item[i]] to the 
mean beta [1] of the distribution of rt [i]. These are subject- and item-specific 
adjustments to the fixed-effects intercept beta[l]. The term u[subj [i]] iden¬ 
tifies the id of the subject for row i in the data-frame; thus, if i = 1, then 
subj [1]=1, and item[l]=13 (see TableH]). 

Running the model We pass the list stanDat of data to stan, which com¬ 
piles a C-l—I- program to sample from the posterior distribution of Model [21 
Stan samples from the posterior distribution of the model parameters, includ¬ 
ing the varying intercepts uoj and wok for each subject j = 1,..., J and item 
k = l,...,K. 
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It may be helpful to rewrite the model in mathematical form following the 
Stan syntax m use a similar notation); the Stan statements are slightly different 
from the way that we expressed Model [2J Dehning i as the row id in the data, 
i.e., i = 1,..., 547, we can write: 


Likelihood : 

— /5o 4" '^[suhj[i]\ 4” '^litem[i]] 4“ /?! X SOi 
rti LogNormal(/Xi, (Te) 

Priors : (3) 

u ^ Normal(0, cr^) w ^ Normal(0, a^) 

(Te; cTu: cTu) Uniform(0, oo) 

P ~ Uniform(—oo, oo) 

Here, notice that the i-th row in the statement for /r identifies the subject 
id ranging from j = 1,..., 37, and the item id ranging from fc = 1,..., 15. 

Summarizing the results The posterior distributions of each of the param¬ 
eters is summarized in Tabled The R values suggest that model has converged. 
Note also that compared to Model 1, the estimate of is smaller; this is be¬ 
cause the other two variance components are now being estimated as well. Note 
that the 95% credible interval for the estimate /3i includes 0; thus, there is some 
evidence that object relatives are easier than subject relatives, but we cannot 
exclude the possibility that there is no difference in the reading times between 
the two relative clause types. 


parameter 

mean 

2.5% 

97.5% 

R 

/3o 

6.06 

5.92 

6.20 

1 

/3i 

-0.04 

-0.08 

0.01 

1 

de 

0.52 

0.49 

0.55 

1 


0.25 

0.19 

0.34 

1 

^ w 

0.20 

0.12 

0.32 

1 


Table 3: The quantiles and the R statistic in the Gibson and Wu data, the 
varying intercepts model. 


2.3 Varying Intercepts, Varying Slopes Mixed Effects Model 

Consider now that subjects who are faster than average (i.e., who have a nega¬ 
tive varying intercept) may exhibit greater slowdowns when they read subject 
relatives compared to object relatives. Similarly, it is in principle possible that 
items which are read faster (i.e., which have a large negative varying intercept) 
may show a greater slowdown in subject relatives than object relatives. The 


11 





opposite situation could also hold; faster subjects may show smaller SR-OR 
effects, or items read faster may show smaller SR-OR effects. Although such 
individual-level variability was not of interest in the original paper by Gibson 
and Wu, it could be of theoretical interest (see, for example, [IS])- Furthermore, 
as Barr and colleagues [I] point out, it is in principle desirable to include a fixed 
effect factor in the random effects as a varying slope if the experiment design is 
such that subjects see both levels of the factor (cf. [2]). 

In order to express this structure in the LMM, we must make two changes 
in the varying intercepts Model [H 

Adding varying slopes The first change is to let the size of the effect for 
the predictor so vary by subject and by item. The goal here is to express that 
some subjects exhibit greater slowdowns in the object relative condition than 
others. We let effect size vary by subject and by item by including in the model 
by-subject and by-item varying slopes which adjust the fixed slope /3i in the 
same way that the by-subject and by-item varying intercepts adjust the fixed 
intercept /3o- This adjustment of the slope by subject and by item is expressed 
by adjusting j3i by adding two terms uij and wik- These are random or varying 
slopes^ and by adding them we account for how the effect of relative clause type 
varies by subject j and by item k. We now express the logarithm of reading 
time, which was produced by subject j reading item k, as the following sum. 
The subscript i indexes the conditions. 

log rty fc = /3o + y-oj + wok + /3i + uuj + wuk +£ijk (4) 

'-V-' '-V-' 

varying intercepts varying slopes 


Defining a variance-covariance matrix for the random effects The 

second change which we make to Model [2] is to define a covariance relationship 
between by-subject varying intercepts and slopes, and between by-items inter¬ 
cepts and slopes. This amounts to adding an assumption that the by-subject 
slopes ui could in principle have some correlation with the by-subject intercepts 
uq; and by-item slopes wi with by-item intercept wq. We explain this in detail 
below. 

Let us assume that the adjustments Uq and Ui are normally distributed 
with mean zero and some variances and cr^i, respectively; also assume 
that Uq and ui have correlation It is standard to express this situation 
by defining a variance-covariance matrix (sometime this is simply called a 
variance matrix). This matrix has the variances of uq and ui respectively along 
the diagonals, and the covariances on the off-diagonals. (The covariance between 
two variables X, Y, Cov(X,Y) is defined as the product of their correlation p 
and their standard deviations ax and cry: Cov(X,Y) = paxay-) 

_ ( ^uO Pu^uO^ul 

— 2 
\Pu^uO(^ul ^ul 



12 




Similarly, we can define a variance-covariance matrix for items, using the 
standard deviations a-^o, cr^i, and the correlation p^. 

_ f *^uiO PwO'wO^wl 

— 2 
\Pw^w0^wl ^wl 

The standard way to express this relationship between the subject intercepts 
uo and slopes ui, and the item intercepts wg and slopes wi, is to define a 
bivariate normal distribution as follows: 



An important point to notice here is that any nx n variance-covariance ma¬ 
trix has associated with it an n x n correlation matrix. In the subject variance- 
covariance matrix the correlation matrix is 


CL "r) ® 

In a correlation matrix, the diagonal elements will always be 1, because a 
variable always has a correlation of 1 with itself. The off-diagonals will have 
the correlations between the variables. Note also that, given the variances cr^o 
and a'^i, we can always recover the variance-covariance matrix, if we know 
the correlation matrix. This is because of the above-mentioned definition of 
covariance. 

A correlation matrix can be decomposed into a square root of the matrix, 
using the Cholesky decomposition. Thus, given a correlation matrix C, we can 
obtain its square root L; an obvious consequence is that we can square L to get 
the correlation matrix C back. This is easy to illustrate with a simple example. 
Suppose we have a correlation matrix: 


C = 


I 

-0.5 



(9) 


We can use the Cholesky decomposition function in R, chol, to derive the 
lower triangular square root L of this matrix. This gives us: 


L = 



0 

0.8660254 


( 10 ) 


We can confirm that this is a square root by multiplying L with itself to 
get the correlation matrix back (squaring a matrix is done by multiplying the 
matrix by its transpose): 


ALT = 


I 

-0.5 


0 

0.8660254 


-0.5 \ 

0.8660254^ 



( 11 ) 


The reason that we bring up the Cholesky decomposition here is that we 
will use it to generate the by-subject and by-item adjustments to the intercept 
and slope fixed-effects parameters. 
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Figure 3: Uncorrelated random variables z = (zi,Z 2 )^ (left) and correlated 
random variables x = (xi,X 2 y (right). 


Generating correlated random variables using the Cholesky decompo¬ 
sition The by-subject and by-item adjustments are generated using the follow¬ 
ing standard procedure for generating correlated random variables x = (xi,X 2 )' 

1. Given a vector of standard deviances (e.g., (Tuo,crui), create a diagonal 
matrix: 

2. Premultiply the diagonalized matrix r with the Cholesky decomposition 
L of the correlation matrix C to get a matrix A. 

3. Generate values from a random variable z = ( 2 : 1 , 22 )^, where Zi and Z2 
each have independent N(0,1) distributions (left panel of Figure [3]). 

4. Multiply A with z; this generates the correlated random variables x (right 
panel of Figure |3|). 

This digression into Cholesky decomposition and the generation of correlated 
random variables is important to understand for building the Stan model. We 
will define a vague prior distribution on L, and a vague prior on the standard 
deviances. This allows us to generate the by-subject and by-item adjustments 
to the hxed effects intercepts and slopes. 

Defining the model With this background, implementing the varying in¬ 
tercepts, varying slope Model |4] is straightforward; see Listing [6] for the code. 
The data block is the same as before. The parameters block contains several 
new parameters. This time, we have vectors sigma_u and sigma_w which are 
(tT„o, CTui)^ and {(Tujo, , instead of scalar values as in Model[21 The variables 
L_u, L_w, z_u, and z_w, which have been declared in the parameters block, play 
a role in the transformed parameters block, a block which we did not use in the 
earlier models. The transformed parameters block generates the by-subject and 
by-item varying intercepts and slopes using the parameters sigma_u, sigma_w, 
L_u, L_w, z_u, and z_w. The J pairs of by-subject varying intercepts and slopes 
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are in the rows of the J x 2 matrix u, and the K pairs of by-item varying 
intercepts and slopes are in the rows of the K x 2 matrix w. 

These varying intercepts and slopes are obtained through the statements 
diag_pre_multiply(sigmaji, L_u) * z_u and diag_pre_multiply(sigmajj, 
L_w) * z_w. This statement generates varying intercepts and slopes from the 
joint probability distribution of equation [71 The parameters L_u, L_w are the 
Cholesky decompositions of the subject and item correlation matrices, respec¬ 
tively, and z_u, and z_w are N(0,1) random variables. 

It is helpful to walk through steps 1 to 4 involved in generating the varying 
intercepts and slopes using the procedure described above for generating cor¬ 
related random variables. The statement diag_pre_multiply(sigma_u,L_u) * 
z_u computes the transpose matrix product (steps 1 and 2). The right multipli¬ 
cation of this product by z_u, a matrix of normally distributed random variables 
(step 3), yields the varying intercepts and slopes (step 4). 


/ Uoi 

Ull\ 

U02 

U12 

\uoj 

UljJ 


(diag((T^o ^ ^ui 


f f CuO ^ ^ ^ f 

i \ 0 CToiy \^^21 ^22/ \221 Z22 


(13) 



T 


Turning to the model block, here, we place priors on the parameters declared 
in the parameters block, and define how these parameters generate log rt (List¬ 
ing | 6 l lines 30-42). The definition of the prior L_u ^ lkj_corr_cholesky (2.0) 
implicitly places a so-called Ikj prior with shape parameter 77 = 2.0 on the 
correlation matrices 



(14) 


where is the correlation between the by-subject varying intercept duo and 
slope (T„i (cf. the covariance matrix of Equation |S|) and pu, is the correlation 
between the by-item varying intercept a^o and slope The Ikj distribution 
with shape parameter 77 = 1.0 is a uniform prior over all 2 x 2 correlation 
matrices; it scales up to larger correlation matrices. The parameter 77 has an 
effect on the shape of the Ikj distribution. Our choice of 77 = 2.0 implies that 
the correlations in the off-diagonals are near zero, reflecting the prior belief that 
there is no correlation between intercepts and slopes. 

The statement to_vector(z_u) ^ normal(0,l) places a normal distribu¬ 
tion with mean zero and standard deviation one on z_u. The same goes for z_w. 
The for-loop assigns to mu the mean of the log-normal distribution from which 
we draw rt[i], conditional on the value of the predictor so[i] for relative 
clause type and the subject and item identity. 
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We can now fit the varying intercepts, varying slopes Model SJ see Listing 0 
for the code. We see in the model summary in Table |4] that the model has 
converged, and that the credible intervals of the parameter of interest, ,5i, still 
include 0. In fact, the posterior probability of the parameter being less than 0 
is now 90% (this information can be extracted as shown in Listing [SJ lines 6-8). 


parameter 

mean 

2.5% 

97.5% 

R 

/3o 

6.06 

5.92 

6.21 

1 

/3i 

-0.04 

-0.09 

0.02 

1 

de 

0.51 

0.48 

0.55 

1 

^uO 

0.25 

0.19 

0.34 

1 

^ul 

0.07 

0.01 

0.14 

1 


0.20 

0.13 

0.32 

1 

^w\ 

0.04 

0.0 

0.10 

1 


Table 4: The quantiles and the R statistic in the Gibson and Wu data, the 
varying intercepts, varying slopes model. 

Figure |4] plots the varying slope’s posterior distribution against the varying 
intercept’s posterior distribution for each subject. The correlation between uq 
and ui is negative, as captured by the marginal posterior distributions of the 
correlation between uq and ui. Thus, Figure H] suggests that the slower a 
subject’s reading time is on average, the slower they read object relatives. In 
contrast. Figure 0] shows no clear pattern for the by-item varying intercepts and 
slopes. We briefly discuss inference next. 


3 Inference 

Having fit a varying intercepts, varying slopes ModelSl we now explain one way 
to carry out statistical inference, using credible intervals. We have used this 
approach to draw inferences from data in previously published work (e.g., [4], 
E)- There are of course other approaches possible for carrying out inference. 
Bayes Factors are an example; see Lee and Wagenmakers [16]. Another is 
to define a Region of Practical Equivalence [14]. The reader can choose the 
approach they find the most appealing. 

3.1 Inference nsing credible intervals 

The result of fitting the varying intercepts, varying slopes Model 0] is the pos¬ 
terior distribution of the model parameters. As mentioned above in connection 
with Models 1-3, direct inference from the posterior distributions is possible. 
For instance, we can find the posterior probability with which the fixed inter¬ 
cept /3i or the correlation between by-subject varying intercepts and slopes 
take on any given value by consulting the marginal posterior distributions whose 
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Figure 4: The top row shows the relationship between the varying slopes (y- 
axis) and intercepts (x-axis) for each subject (left panel) and item (right panel). 
The bottom row shows the posterior distribution of the parameter of correlation 
between the varying slopes and intercepts for each subject (left panel) and item 
(right panel). 




Pi Pu 


Figure 5: Upper and lower bounds on the credible intervals (dashed lines) plot¬ 
ted over the marginal posterior distribution of the fixed slope /3i (left) and of 
the correlation /5„ between the by-subject varying intercepts and varying slopes 
(right). 
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histograms are shown in Figure [SJ The information conveyed by such graphs 
can be sharpened by using the 95% credible interval, mentioned earlier. Ap¬ 
proximately 95% of the posterior density of fii lies between the 2.5 percentile 
—0.092 and the 97.5 percentile 0.023. This leads us to conclude that the slope 
Pi for relative clause type so is less than zero with probability 90% (see line 8 
in Listing [5]) . Since 0 is included in the credible interval, it is difficult to draw 
the inference that object relative clauses are read faster than subject relative 
clauses. However, one could perhaps still make a weak claim to that effect, es¬ 
pecially if a lot of evidence has accumulated in other experiments that supports 
such a conclusion (see [5S] for a more detailed discussion). 

What about the correlations between varying intercepts and varying slopes 
for subject and for item? What can we infer from the analysis about these 
relationships? The 95% credible interval for is (—1,0.1). Our belief that 
is less than zero is rather uncertain, although we can conclude that pu is less 
than zero with probability 90%. There is only weak evidence that subjects who 
read faster than average exhibit greater slowdowns at the head noun of object 
relative clauses than subjects who read slower than average. For the by-item 
varying intercepts and slopes, it is pretty clear that we do not have enough data 
(15 items) to draw any conclusions. For these data, it probably makes sense to 
fit a simpler model [5], with only varying intercepts and slopes for subject, and 
only varying intercepts for items; although there is no harm done here if we fit 
a model with a full variance-covariance matrix for both subjects and items. 

In sum, regarding our main research question, our conclusion here is that 
we cannot say that object relatives are harder to process than subject relatives, 
because the credible interval for /3i includes 0. However, one could argue that 
there is some weak evidence in favor of the hypothesis, since the posterior 
probability of the parameter being negative is approximately 90%. 


4 Example 2: Generalizing the linear mixed model 
to factorial designs 

The Gibson and Wu [5] data-set has a two-condition design. This section 
presents a varying intercepts, varying slopes model for a 2 x 2 factorial de¬ 
sign. Because of the more general matrix formulation we use here, the Stan 
code can be deployed with minimal changes for much more complex designs, 
including correlational studies. 

Our example is the 2x2 repeated measures factorial design of Husain et 
al [11] (Experiment 1), also a self-paced reading study on relative clauses. The 
dependent variable was the reading time rt of the relative clause verb. The 
factors were relative clause type, which we code with the predictor so (so = -1-1 
for object relatives and so = — 1 for subject relatives) and distance between the 
head noun and the relative clause verb, which we code with the predictor dist 
(dist = -|-1 for far and dist = —1 for near). Their interaction is the product 
of the dist and so contrast vectors, and labeled as the predictor int. The 60 
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subjects were speakers of Hindi, an Indo-Aryan language spoken primarily in 
India. The 24 items were presented in a standard, fully balanced Latin square 
design. This resulted in a total of 1440 data points (60 x 24 = 1440). The first 
few lines from the data frame are shown below. 


row 

subj 

item 

SO 

dist 

rt 

1 

1 

14 

s 

n 

1561 

2 

1 

16 

0 

n 

959 

3 

1 

15 

0 

f 

582 

4 

1 

18 

s 

n 

294 

5 

1 

4 

0 

n 

438 

6 

1 

17 

s 

f 

286 

1440 

9 

13 

s 

f 

516 


Table 5: The first six rows, and the last row, of the data-set of Husain et al. 
(2014, Experiment 1), as they appear in the data frame. 

The theoretical interest is in determining whether relative clause type and 
distance influence reading time, and whether there is an interaction between 
these two factors. We use Stan to determine the posterior probability distribu¬ 
tion of the hxed effect /3i for relative clause type, the fixed effect j 32 for distance, 
and their interaction (3^. 

We fit a varying intercepts, varying slopes model to this data-set. This is an 
extension of Model S] The grand mean /3o of logrt is adjusted by subject and 
by item through the varying intercepts mq and wq, which are unique values for 
each subject and item respectively. Likewise, the three fixed effects /3i, /32, and 
/Sa which are associated with the predictors so, dist, and int, respectively, are 
adjusted by the by-subject varying slopes mi, U 2 , and and by-item varying 
slopes u>i, W 2 , and W 3 . 

It is more convenient to represent this model in matrix form. We build up 
the model specification by first noting that, for each subject, the by-subject 
varying intercept uq and slopes ui, U 2 , and U 3 have a multivariate normal prior 
distribution with mean zero and covariance matrix I]„. Similarly, for each item, 
the by-item varying intercept wq and slopes wi, W 2 , and W 3 have a multivariate 
normal prior distribution with mean zero and covariance matrix We can 
write this as follows: 


( 

f 




/wo\ 

/ 


\ 

Ui 

~ N 

0 



Wi 

- N 

0 

5 

U2 

\U3) 

\ 

0 

Vo^ 



W2 

W3/ 

\ 

0 

Vo^ 

/ 


The error e is assumed to have a normal distribution with mean zero and 
standard deviation de. Thus, the varying intercepts, varying slopes here will be 
the same as Model IH just with two additional predictors dist, int along with 
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their associated fixed-effect slopes P 2 , P 3 and random-effect slopes U 2 , U 3 , W 2 , 
and W 3 . 

We proceed to implement the model in Stan. First we read in the data-set 
(see Listing!?]). Instead of passing the predictors so, dist, and their interaction 
int to Stan as vectors, as we did with so earlier, we make so, dist, and int 
into a design matrix X using the function model .matrix available in rH The 
first column of the design matrix X consists of all ones. The second column is the 
predictor so which codes the factor for relative clause type. The third column 
the predictor dist which codes the factor for distance. The fourth column is 
the predictor int which codes the interaction between relative clause type and 
distance. The model matrix thus consists of a fully factorial 2x2 design, with 
blocks of this design repeated for each subject. For the full data-set, we could 
write it very compactly in matrix form as follows: 

log(rt) = X/3 -I- ZuU -f Zuj-w -I- e (16) 

Here, X is the N x P model matrix (with N = 1440, since we have 1440 data 
points; and P = 4 since we have the intercept plus three other fixed effects), 
/3 is a P X 1 vector of fixed effects parameters, Z„ and Z^, are the subject 
and item model matrices {N x P), and u and w are the by-subject and by¬ 
item adjustments to the fixed effects estimates; these are identical to the design 
matrix X in the model with varying intercepts and varying slopes included. 
For more examples of similar model specifications in Stan, see the R package 
RePsychLing on github (https://github.com/dmbates/RePsychLing). 

Having defined the model, we proceed to assemble the list stanDat of data, 
relying on the above matrix formulation; please refer to Listing |7| The number 
N of observations, the number J of subjects and K of items, the reading times rt, 
and the subject and item indicator variables subj and item are familiar from 
the previous models presented. The integer P is the number of fixed effects (four 
including the intercept). Model [1^1 includes a varying intercept Uq and varying 
slopes ui, U 2 , U 3 for each subject, and so the number n_u of by-subject random 
effects equals P. Likewise, Model [T6| includes a varying intercept wq and varying 
slopes wi, W 2 , W 3 for each item, and so the number n_w of by-item random effects 
also equals P. The data block contains the corresponding variables. We declare 
the fixed effects design matrix X as an array of N row vectors whose components 
are the predictors associated with the N reading times. Likewise for the subject 
and item random effects design matrices Z_u and Z_w, which correspond to Z„ 
and Zyj respectively in Model [1^1 The vector beta contains the fixed effects 
/3o, /3i, P 2 , and ^ 3 . The matrices L_u, L_w and the arrays z_u, z_w of vectors 
(not to be confused with the design matrices Z_u and Z_w) will generate the 
varying intercepts and slopes uq, ..., U 3 and wq, ..., W 3 , using the procedure 
described for Model Sj The vector sigma_u contains the standard deviations of 
the by-subject varying intercepts and slopes uq, ..., U 3 , and the vector sigma_w 
contains the standard deviations of the by-item varying intercepts and slopes 

^Here, we would like to acknowledge the contribution of Douglas Bates in specifying the 
model in this general matrix form. 
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grand mean 


relative clause type 



6.50 6.60 6.70 6.80 


Po 

distance 



-0.10 -0.05 0.00 0.05 0.10 


P2 



-0.10 -0.05 0.00 0.05 0.10 


Pi 

interaction 



-0.10 -0.05 0.00 0.05 0.10 


Figure 6: Marginal posterior distribution and HPD intervals of the fixed effects 
grand mean /?o, slope /3i for relative clause type, slope P 2 for distance, and 
interaction /?3. All fixed effects are on the log-scale. 


rco, ■ • ■ , The variable sigma_e is the standard deviation < 7 ^ of the error 
£. The transformed parameters block generates the by-subject intercepts and 
slopes Mo, ..., M3 and the by-item intercepts and slopes icq, • ■ ■, wz- 

We place Ikj priors on the random effects correlation matrices through the 
lkj_corr_cholesky(2.0) priors on their Cholesky factors L_u and L_w. We 
implicitly place uniform priors on the fixed effects /3o, ..., /Ss, the random 
effects standard deviations (T„o, ■ • ■ j cr„3, and (Tu,o, ..., (Tu ,3 and the error stan¬ 
dard deviation ae by omitting any prior specifications for them in the model 
block. We specify the likelihood with the probability statement that rt [i] is 
distributed log-normally with mean X[i] * beta + Z_u[i] * u[subj[i]] + 
Z_w[i] * w[item[i]] and standard deviation sigma_e. The next step towards 
model-fitting is to pass the list stanDat to stan, which compiles a C-|— I- program 
to sample from the posterior distribution of the model parameters. 

Figure |6] plots histograms of the marginal posterior distribution of the fixed 
effects. The HPD interval of the fixed effect /3i for relative clause type is entirely 
below zero. This is evidence that object relatives are read faster than subject 
relatives. The HPD interval of the fixed effect $2 for distance is also entirely 
below zero. This is evidence of a slowdown when the verb (where reading time 
was measured) is closer to the head noun of the relative clause. The HPD of the 
interaction Ps between relative clause type and distance is greater than zero, 
which is evidence for a greater slowdown on subject relatives when the distance 
between the verb and head noun is short. 

A major advantage of the above matrix formulation is that we do not need 
to write a new Stan model for a future repeated measures factorial design. All 
we have to do now is define the design matrix X appropriately, and include it 
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(along with appropriately defined and for the subjects and items random 
effects) as part of the data specification that is passed to Stan. 

5 Concluding remarks, and further reading 

We hope that this tutorial has given the reader a flavor of what it would be like 
to fit Bayesian linear mixed models. There is of course much more to say on 
the topic, and we hope that the interested reader will take a look at some of 
the excellent books that have recently come out. We suggest below a sequence 
of reading that we found helpful. A good first general textbook is by Gelman 
and Hill [B]; it begins with the frequentist approach and only later transitions 
to Bayesian models. The forthcoming book by Mcelreath [20] is also excellent. 
For those looking for a psychology-specific introduction, the books by Kruschke 
m and Lee and Wagenmakers m are to be recommended, although for the 
latter the going might be easier if the reader has already looked at Gelman 
and Hill |B]. As a second book, m is recommended; it provides many inter¬ 
esting and useful examples using the BUGS language, which are discussed in 
exceptionally clear language. Many of these books use the BUGS syntax [18] . 
which the probabilistic programming language JAGS [23] also adopts; how¬ 
ever, Stan code for these books is slowly becoming available on the Stan home 
page (https://github.com/stan-dev/example-models/wiki). For those with in¬ 
troductory calculus, a slightly more technical introduction to Bayesian methods 
by Lynch m is an excellent choice. Finally, the textbook by Gelman and 
colleagues [^ is the definitive modern guide, and provides a more advanced 
treatment. 
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## read in data: 

rDat <- read.table ( "gibsonwu2012data.txt" , header=TRUE ) 

## subset critical region: 

rDat <- subset ( rDat , region == "headnoun" ) 

## create data as list for Stan, and fit model: 

stanDat <- list ( rt = rDat$rt, so = rDat$type, N = nrow(rDat) ) 
library(stan) 

fixEfFit <- stan ( file = "fixEf.stan" , data = stanDat , 
iter = 2000 , chains = 4 ) 

## plot traceplot, excluding warm-up: 
traceplot ( fixEfFit , pars = c ("beta",’’sigma_e") ^ 
inc_warmup = FALSE) 

## examine quantiles of posterior distributions: 
print ( fixEfFit , pars = c("beta","sigma_e") , 
probs = c(0.025,0.5,0.975)) 

## examine quantiles of parameter of interest: 

betal <- extract ( fixEfFit , pars=c("beta[2]"))$beta 

print ( signif { quantile { betal,probs = c(0.025,0.5,0.975)) 

, 2)) 


Listing 1: Code for the fixed effects Model 1. 


data { 


int<lower=l> N; 

//number of data points 

real rt[N]; 

//reading time 

real<lower=-l,upper=l> 

so[N]; //predictor 

parameters { 


vector[2] beta; 

//intercept and slope 

real<lower=0> sigma_e; 

//error sd 

model { 


real mu; 

for (i in 1:N){ 

// likelihood 

mu <- beta[l] + beta 

2]*so [i]; 

rt[i] ~ lognormal{mu, 

} 

} 

sigma_e); 


Listing 2: Stan code for the fixed effects Model 1. 
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## format data for Stan: 

stanDat<-list(subj=as.integer(factor(rDatSsubj )), 
item=as.integer(factor(rDat$item )), 
rt=rDatSrt, 
so=rDat$so, 

N=nrow(rDat ), 

J=length(unique(rDat$subj )), 

K=length(unique(rDat$item))) 

## Sample from posterior distribution: 
ranIntFit <- stan(file="ranlnt.stan", data=stanDat, 
iter=2000, chains=4) 

## Summarize results: 

print(ranIntFit,pars=c("beta" , "sigma_e","sigma_u","sigma_w"), 
probs=c(0.025,0.5,0,975)) 

betal <- extract(ranIntFit,pars=c("beta[2]"))$beta 
print(signif(quantile(betal,probs=c(0.025,0.5,0.975)),2)) 

## Posterior probability of betal being less than 0: 
mean (betal<0) 


Listing 3: Code for running Model 2, the varying intercepts model. 
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data { 

int<lower=l> N; 
real rt[N]; 

real<lower=-l,upper=l> so[N]; 
int<lower=l> J; 
int<lower=l> K; 

int<lower=l, upper=J> subj[N]; 
int<lower=l, upper=K> item[N]; 


//number of data points 
//reading time 
//predictor 
//number of subjects 
//number of items 
//subject id 
//item id 


parameters { 

vector[2] beta; 
vector[J] u; 
vector[K] w; 
real<lower=0> sigma_e; 
real<lower=0> sigma_u; 
real<lower=0> sigma_w; 

} 


//fixed intercept and slope 

//subject intercepts 

//item intercepts 

//error sd 

//subj sd 

//item sd 


model { 
real mu; 

//priors 

u ~ normal(0,sigma_u); //subj random effects 

w ~ normal(0,sigma_w); //item random effects 

// likelihood 
for (i in 1:N){ 

mu <- beta[l] + u[subj[i]] + w[item[i]] + beta[2]+so[i]; 
rt[i] ~ lognormal(mu,sigma_e); 



Listing 4: Stan code for running Model 2, the varying intercepts model. 


ranlntSlpFit <- stan(file="ranIntSlp.stan", data = stanDat, 
iter=2000, chains = 4) 

## posterior probability of beta 1 being less 
## than 0: 

betal <- extract(ranlntSlpFit,pars=c("beta[2]"))$beta 
print(signif(quantile(betal,probs=c(0.025,0.5,0,975)),2)) 
mean (betal<0) 


Listing 5: Code for running Model 3, the varying intercepts, varying slopes 
model. 
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data { 

int<lower=l> N; 
real rt[N]; 

real<lower=-l,upper=l> so[N]; 
int<lower=l> J; 
int<lower=l> K; 

int<lower=l, upper=J> subj[N]; 
int<lower=l, upper=K> item[N]; 


//number of data points 
//reading time 
//predictor 
//number of subjects 
//number of items 
//subject id 
//item id 


parameters { 

vector[2] beta; 
real<lower=0> sigma_e; 
vector<lower=0>[2] sigma_u; 
vector<lower=0>[2] sigma_w; 
cholesky_factor_corr[2] L_u; 
cholesky_factor_corr[2] L_w; 
matrix[2,J] z_u; 
matrix[2,K] z_w; 

} 


//intercept and slope 
//error sd 
//subj sd 
//item sd 


transformed parameters! 
matrix[2,J] u; 
matrix[2,K] w; 


u <- diag_pre_multiply(sigma_U/ L_u) 
w <- diag_pre_multiply(sigma_w, L_w) 


z_u; //subj random effects 
z_w; //item random effects 


model { 
real mu; 

//priors 

L_u ~ Ikj_corr_cholesky(2.0); 

L_w ~ Ikj_corr_cholesky(2.0); 
to_vector(z_u) ~ normal(0,1); 
to_vector(z_w) ~ normal(0,1); 

//likelihood 
for (i in 1:N){ 

mu <- beta[l] + u[l,subj[i]] + w[l,item[i]] 

+ (beta[2] + u[2,subj[i]] + w[2,item[i]])*so[i]; 
rt [i] ~ lognormal(mu,sigma_e); 

} 


Listing 6: The Stan code for Model 3, the varying intercepts, varying slopes 
model. 
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rDat<-read.table("HusainEtAlexptldata.txt",header=TRUE) 
rDat$subj <- with(rDat,factor(subj)) 
rDat$item <- with(rDat,factor(item)) 

X <- unname(model.matrix(~1+so+dist+int, rDat)) 

stanDat <- within(list(), 

{ 

N<-nrow(X) 

P <- n_u <- n_w <- ncol(X) 

X <- X 
Z_u <- X 
Z_w <- X 

J <- length(levels(rDat$subj)) 

K <- length(levels(rDat$item)) 
rt <- rDat$rt 

subj <- as.integer(rDat$subj) 
item <- as.integer(rDat$item) 

} 

) 

factorialFit <- stan (file=’’factorialModel. stan", 
data=stanDat , 
iter=2000, chains=4) 


Listing 7: Preparation of data for analyzing the Husain et al. data-set, and 
running the model. 
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data { 

int<lower=0> N; 
int<lower=l> P; 
int<lower=0> J; 
int<lower=l> n_u; 
int<lower=0> K; 
int<lower=l> n_w; 
int<lower=l,upper=J> subj[N]; 
int<lower=l,upper=K> item[N]; 
row_vector[P] X[N]; 
row_vector[n_u] Z_u[N]; 
row_vector[n_w] Z_w[N]; 
vector[N] rt; 


//no trials 

//no fixefs 

//no subjects 

//no subj ranefs 

//no items 

//no item ranefs 

//subject indicator 

//item indicator 

//fixef design matrix 

//subj ranef design matrix 

//item ranef design matrix 

//reading time 


parameters { 

vector[P] beta; //fixef coefs 

cholesky_factor_corr[n_u] L_u; //cholesky factor of subj ranef 

cholesky_factor_corr[n_w] L_w; //cholesky factor of item ranef 

vector<lower=0>[n_u] sigma_u; //subj ranef std 

vector<lower=0>[n_w] sigma_w; //item ranef std 

real<lower=0> sigma_e; //residual std 

vector[n_u] z_u[J]; //subj ranef 

vector[n_w] z_w[K]; //item ranef 


transformed parameters { 


vector [n_ 

_u] 

U[J] ; 


//subj 

ranefs 


vector [n_ 

_w] 

w [K] ; 


//item 

ranefs 


i 

matrix 

[n_ 

_u, n_u ] 

Sigma_u; 

//subj 

ranef cov 

matrix 

matrix 

[n_ 

_w, n_w ] 

Sigma_w; 

//item 

ranef cov 

matrix 


Sigma_u <- diag_pre_multiply(sigma_u,L_u); 
Sigma_w <- diag_pre_multiply(sigma_w,L_w); 
for(j in 1:J) 

u[j] <- Sigma_u * z_u[j]; 
for{k in 1:K) 

w[k] <- Sigma_w * z_w[k]; 


model { 

//priors 

L_u ~ Ikj_corr_cholesky(2.0); 

L_w ~ Ikj_corr_cholesky (2.0); 
for (j in 1:J) 

z_u[j] ~ normal(0,1); 
for (k in 1:K) 

z_w[k] ~ normal(0,1); 

//likelihood 
for (i in 1:N) 

rt[i] ~ lognormal(X[i] * beta + 

Z_u[i] * u[subj[i]] + 
Z_w[i] * w[item[i]], 
sigma_e); 

} 


Listing 8: Stan code for Husain et al data. 


corr natrix 
corr natrix 
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